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Abstract 

Linear-scaling electronic-structure techniques, also called O(N) techniques, rely heavily on the mul- 
tiplication of sparse matrices, where the sparsity arises from spatial cut-offs. In order to treat very large 
systems, the calculations must be run on parallel computers. We analyse the problem of parallelising the 
multiplication of sparse matrices with the sparsity pattern required by linear-scaling techniques. We show 
that the management of inter-node communications and the effective use of on-node cache are helped by 
organising the atoms into compact groups. We also discuss how to identify a key part of the code called 
the 'multiplication kernel', which is repeatedly invoked to build up the matrix product, and explicit code 
is presented for this kernel. Numerical tests of the resulting multiplication code are reported for cases of 
practical interest, and it is shown that their scaling properties for systems containing up to 20,000 atoms 
on machines having up to 512 processors are excellent. The tests also show that the cpu efficiency of our 
code is satisfactory. 

1 Introduction 

There is rapidly growing interest in linear-scaling methods for electronic structure calculations, in which 
the memory and number of cpu cycles are proportional to the number of atoms [jjj. Several practical codes 
exist for performing tight-binding first-principles |J, || or 



Hartree-Fock |18|, 19, [20| calculations using linear-scaling techniques - often called order- N or O(N) 
techniques. For very large systems containing thousands or tens of thousands of atoms, these codes need 
to run efficiently on parallel computers. Most of the existing linear-scaling methods rely heavily on the 
multiplication of sparse matrices, and this means that sparse-matrix multiplication on parallel computers is 
a key issue for the future of linear-scaling electronic structure work. To achieve the best efficiency, parallel 
codes to perform this multiplication must be designed to treat the special patterns of sparsity that arise 
in electronic structure, which we shall refer to as 'local sparsity'. The aims of this paper are to analyse 
the design of parallel code for multiplying matrices having local sparsity, to present the algorithms we have 
developed for the Conquest electronic structure code ^3|, and to report the results of practical tests 
on the efficiency and linear-scaling properties of the code. 

In the tight-binding approach to electronic structure, we are concerned with the Hamiltonian and overlap 
matrices Hi a ,j0 and Siajp- (Here, i, j label atoms, while a, (3 label the basis functions on each atom.) Linear- 
scaling methods generally require also the density matrix A^ aj( 3 and sometimes also an 'auxiliary' density 
matrix Li a - we follow the notation of Ref. |24[ . Linear-scaling methods based on density-functional theory 
(DFT) or Hartree-Fock theory work with similar quantities, and the DFT-pseudopotential approach also 
needs scalar products Pi a ,j\ between localised orbitals on atom i and angular-momentum projectors for non- 
local pseudopotcntials on atom j. The elements of all these matrices tend to zero as the distance Rij between 
atoms i and j tends to infinity, and this is what makes linear-scaling eletronic structure theory possible. In 
the approach used in the Conquest code [^2|, ||), and other related approaches (see e.g. Rcfs (?], ^, ||[ |2q), 
the approximation is made that all the matrix elements vanish exactly when exceeds a specified cut-off 
radius, which generally differs for different matrices. Once this approximation has been made, linear-scaling 
behaviour for both memory and cpu cycles is guaranteed - provided, of course, appropriate algorithms are 
used. All the matrices are sparse, since their only non-vanishing elements are those for which is less 
than the appropriate cut-off. In everything that follows, the particular pattern of sparsity that arises from 
the spatial cut-offs is crucial, and it is this pattern that we call 'local sparsity'. 

The techniques widely used to ensure the idempotency of the density matrix K ^4|, [32| rely on the 
calculation of matrix products such as (LSL)i a jp, (LSLSL)i a jp and other similar products involving H and 
P. Sometimes, these products are needed only for interatomic distances Rij that are smaller than the full 
distances for which they have non- vanishing elements, so that we do not need to calculate all their elements. 



The general problem is therefore to perform the product: 



(1) 

fee 

where the elements of A and B are non- vanishing only for Rik < Ra and Rkj < Rb respectively, and 
elements of C are required only for Rij < Rc, where Ra, Rb, Rc are specified cut-off distances. The indices 
fi, v, £ may correspond to localised orbitals or angular momentum projectors on each atom, but in any case 
they run only over a small set of values. In general, the number of values might depend on the atom, but 
we ignore this possible complication here and assume that /i runs from 1 to rt\, v runs from 1 to n 2 , and £ 
runs from 1 to n^. Unless the distinction is important, we shall refer to n\ 1 ni or 7^3 simply as n. 

The Conquest DFT code was written from the start as a parallel code, and the methods used to 
parallelise sparse- matrix multiplications and all the other operations were reported earlier p3), together 
with tests of the scaling behaviour. Although the scaling was excellent, the major problem with the matrix 
multiplication methods that we used earlier was that they required elaborate indexing, which consumed 
excessive amounts of memory. This is why we have carried out the much more thorough analysis of locally 
sparse matrix multiplication reported here. 

There has also been other work on parallelising the multiplication of locally sparse matrices for linear- 
scaling electronic structure work, using both tight-binding [^7], |28|, ^| and Hartree-Fock techniques 
Itoh et al. |27} and Canning et al. [^] describe the parallelisation of the two closely related linear-scaling 
methods reported in Refs. 0, [2(| and |25|, and find satisfactory speed-ups of up to 7.3 for an eight- 
fold increase of processor number. Challacombe [^lf presented a very general parallelisation scheme for 
linear-scaling Hartree-Fock calculations using Gaussian basis sets, but found speed-ups of only 75 % for 
an eight-fold increase of processor number. Wang et al. made passing reference to the parallelisation of 
the tight-binding density- matrix method [j29| , but no details appear to have been published. The popular 
tight-binding linear-scaling scheme known as the Fermi-operator method (h]] has also been parallelised, but 
this relies on matrix-vector rather than matrix-matrix operations, and is less relevant here. 

In considering the efficiency of a multiplication code, we note that there is a certain irreducible minimum 
number of multiply-and-adds that must be done to accomplish the calculation of eqn For a parallel 
machine whose processors have a given peak speed, this sets a well defined lower bound to the execution 
time. The ratio between this lower bound and the practical execution time gives a measure of the cpu 
efficiency of the code. The superscalar processors of main interest to us can sometimes achieve 50 % of peak 
speed in the multiplication of large matrices, but it would be unrealistic to expect this kind of efficiency for 
sparse matrices. On the other hand, it would be cause for concern if the cpu efficiency fell much below 10 %. 
We shall show that efficiencies of around 10 % are indeed achievable. We shall show that the methods we 
propose achieve scaling very close to linear, both as the numbers of atoms and processors are increased for 
a fixed number of atoms per processor, and as the number of processors is increased for a fixed number of 
atoms. 

We shall see that achievement of high efficiency and good scalability relies heavily on two things: first, 
good design of storage patterns of matrix elements; second, the use of different schemes for labelling atoms 
in different parts of the calculation, together with efficient transcription between schemes. For both storage 
and labelling, we shall show the advantages of grouping atoms into small spatially compact groups (here 
called 'partitions'), and assignment of groups of these partitions (here called partition 'bundles') to individual 
processors. 

The next section of the paper presents our analysis of the problem, and the set of algorithms that 
the analysis leads us to. We also present the piece of code that lies at the heart of our multiplication 
program, which we refer to as the 'multiplication kernel'. Our practical code is written in Fortran 90, with 
communications handled by MPI1. In Sec. 3, we outline the main principles used in programming our 
algorithms. Sec. 4 reports the range of tests on a Cray T3E, an SGI Origin2000 and a beowulf cluster that 
we have performed to probe the efficiency and scaling behaviour of the multiplication code. The final Section 
discusses our results and draws conclusions. In particular, we shall argue that other possible approaches to 
the problem of multiplying locally sparse matrices on parallel machines are unlikely to achieve significantly 
better efficiency or linear scaling behaviour. 

2 Development of techniques 

In order to achieve high efficiency, we need careful design of a small set of operations that are repeatedly 
performed to build up the product matrix: this set of operations is the 'multiplication kernel'. Some 



of the data that serve as input to the kernel need to be communicated from remote nodes, and linear 
scaling will only be achieved if the communications strategy is properly designed. Both the multiplication 
kernel and the communications need indexing, and it is important that both the memory and the cpu time 
needed to manipulate index arrays are kept to a minimum. Before discussing the multiplication kernel, 
the communications strategy and the indexing, we outline first the starting assumptions and the spatial 
organisation of atoms into partitions and bundles. 

For ease of presentation, we confine ourselves in much of the following to the case Rc > Ra + Rb, which 
we refer to as the 'maximal' case; it is the case where every element of the product matrix C = A ■ B must 
be calculated. At the end of this Section, we shall show how the case of a general relation between Ra, Rb 
and Rc can be handled by exactly the same principles that will be described for the maximal case. We 
also avoid initially any discussion of periodic boundary conditions. The Conquest code is able to handle 
periodic boundary conditions, though this may not always be needed, so that the matrix multiplication code 
must be capable of including periodicity. But to simplify the discussion, we defer discussion of periodicity 
till Sec. O 



2.1 Assumptions 

Following Ref. f22| , we assume that each node is responsible for a group of atoms: we call this the node's 
'primary set' of atoms. We also assume that matrix elements are stored by rows, so that for any matrix 
X, the elements Xj M -„ for all /x, j, v are held by the node whose primary set contains i. Finally, we 
assume that the multiplication operations needed to calculate the elements Ci MJ „ of the product matrix are 
performed by the node whose primary set contains i. This means that only matrix elements Bui.jv need to 
be communicated between nodes. We discuss in Sec. |^ whether anything could be gained by changing these 
assumptions. 

With these assumptions, we already encounter a key question, concerning the interleaving of communica- 
tions and calculations. The matrix multiplications performed by each node can be broken into contributions, 
with each contribution associated with a particular set of atoms k. For each set, we fetch the B^j V data for 
k in the set, and then perform the multiply- and- adds of eqn ([!]) for those k, accumulating the results onto 
the array of the product matrix; then we move to the next set of k atoms and repeat the communications 
and calculations. Two extremes can be envisaged. At one extreme, we start by bringing all the B^.ju for all 
the k entering eqn (jl]) from the remote nodes onto the local node; after this has been done, we then perform 
all the multiplications. This is the coarsest possible interleaving. At the other extreme, we fetch the B^ju 
for just a single atom k at a time, and perform the multiplications for that k, before moving on to the next 
k. This is the finest possible interleaving. Neither extreme will be satisfactory. The coarse extreme requires 
an unnecessarily large amount of local memory to store all the B^^ v elements; the finest extreme requires 
the transmission of unnecessarily small packets of B^.ju data, so that latency will slow the communications. 
Somewhere between the extremes, there is a good compromise between memory and latency. 

In fact, there is no point in communicating all B^ j V data before multiplying: the coarsest interleaving 
that we should ever consider is where we communicate all the B^ju needed from a particular remote node 
before performing the multiplications; then we move to the next node. This interleaving already yields 
the lowest possible latency. It follows that the outermost loop in matrix multiplication must be a loop 
over remote nodes. However, there is no reason why this coarsest interleaving should give an acceptable 
compromise, and we shall generally want to reduce the memory requirement further by breaking the atoms 
k into smaller sets. Within the outer loop over nodes, we shall therefore want a loop over sets of atoms k in 
each particular node. 

Our basic assumptions about the distribution of matrix elements over nodes therefore require the general 
interleaving of communications and calculations shown in Fig. fil. 



2.2 Partitions, bundles, haloes and the outline multiplication scheme 

The primary sets of atoms should be chosen to be spatially compact: the atoms should all be near each other. 
The reason for this is that compactness of the primary set helps to reduce the number of atoms k entering 
equation (|l|) for any given node, so that fewer B^j v elements have to be communicated. Put another way, 
primary set compactness means that each B^ju communicated is used more times. Since the time spent in 
communications limits the parallel scalability, compactness help scalability. 

However, it would limit flexibility if we took the primary sets to be the smallest organisational unit of 
atoms. The size of the primary sets depends on the number of processors used, and this may depend on 
unpredictable circumstances. So we need a smaller and more stable organisational unit. We call this a 
'partition'. A partition is a small, spatially compact set of atoms that does not depend on the number of 



processors. Each primary set consists of a spatially compact bundle of partitions. Clearly, in assembling 
partitions to make primary sets, a key consideration will be load balancing, so we shall need each primary set 



to contain roughly the same number of atoms. Load balancing will be discussed in more detail in Sec. |2.10 . 

Partitions could in principle be formed in many ways. At present, the simulation cell used in the 
Conquest code is required to be orthorhombic (this requirement will, of course, be removed in the future) . 
We divide each of the three edges of this cell using a uniform grid, and the orthorhombic subcells thus 
formed are used as partitions. For some arrangements of atoms - for example, slabs of crystal surrounded 
by vacuum - there may be partitions that contain no atoms. This creates no problem, though for good load 
balancing we shall require that no primary sets should consist only of empty partitions. 

To discuss the outline multiplication scheme in terms of partitions, we now introduce some terminology, 
which is illustrated schematically in Fig. ^[ For each atom i in the primary set, there are atoms k in the 
system whose distance from i is less than Ra; we call these the A-neighbours of i. The atoms k which are 
A-neighbours of at least one atom in the primary set form a set which we call the A-halo. We refer to atoms 
in this set as the A-halo atoms. The set of partitions containing at least one A-halo atom are called the 
A-halo partitions. Finally, the set of nodes responsible for at least one A-halo partition are called the A-halo 
nodes. 

The multiplication scheme can now be expressed thus: We loop over A-halo nodes. For each such node, 
we loop over groups of atoms k, communicate the -Bfcjjv elements for atoms in the group, perform the 
corresponding multiplications, and accumulate onto the C-matrix array. The groups of atoms k will now 
be identified as A-halo partitions, so that the inner loop in Fig. [I] is a loop over the A-halo partitions on 
the given A-halo node. Partitions therefore serve two purposes: first, they aid in the flexible construction of 
primary sets; second, they provide the sub-units needed in the communication of i?-matrix data. 

A number of issues are raised by the partition scheme. We must consider the size of f?-data packets to 
be communicated, and how to avoid fetching unwanted _B-data. This question is intimately related to the 
choice and structure of the multiplication kernel. It is also related to the question of the indexing needed so 
that the multiplication kernel knows which atom pairs are associated with each piece of i?-data. Finally, we 
must consider the indexing needed by the multiplication kernel in order to accumulate C. Since the nature 
of the multiplication kernel lies at the heart of these issues, we address this next. 

2.3 Choice and structure of the multiplication kernel 

As explained above, the multiplication kernel is a small set of operations repeatedly performed on each node, 
to build up the rows of the product matrix Ci^^ v for which the node is responsible. More precisely, it is the 
set of operations on which detailed coding effort will be concentrated in order to maximise the efficiency. 
The choice of kernel is therefore crucial. 

Of the choices that might be made, we can envisage two extremes, which resemble those mentioned 



when we discussed the interleaving of communications and calculations (Sec. 2A). At one extreme, we could 
choose the kernel to consist of the multiplication of the n x n matrices associated with each triplet of atoms 
(i,k,j). (Recall that m, Ti2, TI3 are the numbers of indices on the atoms i, j, fc; to simplify the discussion, 
we refer to them without distinction as n.) If we made this choice, the kernel would sit inside a triple loop 
over k, i and j. We refer to this as the 'fine' choice. At the other extreme is the 'coarse' choice, where the 
kernel consists of the entire triple loop over k, i and j, and everything inside it. Neither extreme is likely 
to be satisfactory. We note that the fine choice will ensure good reuse of data in registers, as discussed in 
Ref. p8f . In linear-scaling calculations, the dimension n will be rather small; in treating s-p semiconductors, 
for example, it will often have the value 4, so that the data for a given triplet will generally fit into registers 
and primary cache. However, each data item is used only n times, and the use of secondary cache will be 
poor. On the other hand, the coarse extreme is too undifferentiated. Since the full set of A, B and C data 
for all i, j, k will not fit into secondary cache in any practical situation, there is no point in making this 
choice of kernel. Clearly, the operations should be broken into small groups, but these should not be as small 
as (i, k,j) triplets. 

An obvious intermediate choice of kernel is to take it as the set of operations associated with a given k. 
For a given k, the label i goes over all A- neighbours of k in the primary set, and j goes over all -B-neighbours 
of k in the system. This will be excellent for cache reuse of B^.ju, since j will run over B- neighbours of k 
in the sequence in which they are stored. The same thing will happen for A, provided the Ai^^ elements 
are stored by blocks associated with a given k - effectively this means that we store the 'local transpose' of 
A. If we denote the number of .B-neighbours of k by Nu and the number of primary A-neighbours of k by 
then potentially every element of A in cache can be used nNj? times, and every element of B in cache 
can be used nM^ times. 



However, this still disregards cache reuse of C. In order to help this, we should take the kernel to be 
the set of operations associated with a small group of atoms k, and this group must clearly be spatially 
compact. We identify this group as the partition. A great advantage of this choice of kernel is that it fits 
naturally with our proposed interleaving scheme for communications (Sec. |2.l| ). The overall scheme is now 
that each node loops over its A-halo partitions. For each partition, £?-data is communicated for the entire 
partition. The multiplication kernel then performs all the multiply-and-adds and the accumulation of C for 
A-halo atoms k in the current partition. 

We note in passing how the partition scheme helps efficient data transfer from memory. The j atoms 
that we loop over are grouped into the spatially compact partitions used to organise the storage of B. Since 
the C-matrix is also stored with the j-index grouped into these partitions, and since superscalar processors 
generally transfer data from main memory into cache in contiguous sets (cache-lines) this should considerably 
enhance the chances of finding C-matrix elements in cache when they are needed. The underlying thought 
here is that the partition scheme helps to improve 'data locality', as discussed by Goedecker |33|. 

We implied just now that, when £>-data is fetched from the remote node, we communicate the B^j V 
for all k in the current A-halo partition. Since these k are not necessarily all in the A-halo, this means 
that in general not all B^ju communicated are actually used, so that there is a waste of communications. 
We do this, since we believe that it is better to tolerate the waste than to limit communications strictly to 
the Bk£jv needed. There are two reasons for this: first, effort would need to be put into generation of the 
indexing needed; second, the length of the packets communicated would be reduced, so that latency would 
slow the calculations. 

In order to implement these ideas, we must now consider the two kinds of indexing needed: first, the 
indexing needed to identify the atom pairs associated with the _B-data; second, the indexing needed to 
accumulate the C-matrix. In order to develop an overview of the indexing question, we first discuss schemes 
for labelling atoms. 

2.4 The labelling of atoms 
2.4.1 General ideas about labelling 

With the partition scheme outlined above, matrix multiplication in a parallel linear-scaling code will need 
at least the following five ways of labelling atoms: 

1. Global labelling. As the atoms move around, and responsibilities for atoms are passed from one node 
to another, every atom in the entire system needs its own unique label. We call this scheme 'global 
labelling'. The natural way is to label the atoms from 1 to N to t, where iV to t is the total number of 
atoms. 

2. Partition labelling. Since atoms are organised into partitions, we can refer to them by saying 'the 
nth atom in the pth partition'. To ensure that all nodes use the same scheme, we adopt a standard 
order for enumerating the partitions of the entire system, and a standard order for the atoms in each 
partition - the latter can be in increasing order of global label, for example. We refer to this scheme 
as 'partition labelling'. 

3. Local labelling. Each node needs to know only about a subset of atoms: the atoms in its primary set, 
plus the atoms on other nodes that are in range of the matrices it has to deal with. Strictly speaking, 
it needs to know only about atoms in the halo associated with the largest cut-off radius. But in order 
to construct the haloes, it will need to find out about a somewhat larger set, which we call the 'grand 
covering set' (GCS); the number of atoms in this set is denoted by Agcs- We label the atoms in the 
GCS sequentially from 1 to Agcs> and call this scheme 'local labelling'. 

4. Halo labelling. When looping over atoms in a particular halo - for example the A-halo when we do 
matrix multiplication - it will be convenient to label the halo atoms sequentially. We refer to this as 
'halo labelling'. 

5. Neighbour labelling. In order to store matrix elements economically, we shall want to store only 
those elements A, MJ v etc. for which the distance between atoms i and j is less than the cut-off Rx, 
since other elements vanish by definition. For each primary atom i, we shall run sequentially through 
A-neighbours j. This scheme for referring to neighbours j is called 'neighbour labelling'. Of course, 
the label given to an atom j in this scheme depends on the primary atom i. 



Note that in these general schemes there is much freedom in the order of the labelling. For example, 
in partition labelling, we can list the partitions in different orders. Similarly, in local, halo and neighbour 
labelling, we can list the atoms in different orders. We shall want to design the storage patterns so as to 
optimise the efficiency of the multiply- and- add operations, and so as to facilitate transcription between the 
different labellings. To see how this works in practice, we now study how to pass information between nodes 
and how to determine the storage locations in the C-array when accumulating the results of multiply-and- 
adds. 



2.4.2 Passing information between nodes 

When the matrix elements Bk^j v are passed between nodes, information about the identity of the B- 
ncighbours j of each atom k must also be passed, so that the receiving node can determine where to 
accumulate the elements of the product Cj„ j U . We have to ask what kind of labelling should be used to 
send the identity of j. 

Global labelling would clearly express the identity of j unambiguously, but there is an objection to using 
this. The receiving node will need to transcribe to the labelling it uses to refer to the storage locations 
of C, which is essentially C-neighbour labelling. While each node can transcribe from its own local, halo 
or neighbour labelling to global labelling using an amount of storage that is independent of the size of the 
system, transcription from global labelling to the other kinds demands a storage proportional to the size of 
the sytem, and this violates the requirement of linear-scaling behaviour. This may not always be a problem 
in practice, but it is certainly objectionable in principle. 

Partition labelling offers a simple way to overcome this. For each B- neighbour j, we specify its partition 
and its sequence number in the partition, with the partition identified by giving its offset in the three 
Cartesian directions relative to the partition containing k. The receiving node then uses the partition offset 
and sequence number to transcribe to its own local label for each j. The separate indexing needed by the 
receiving node to determine the storage location of C-elements is described next. 



2.4.3 Determining storage locations of C 

To achieve the best speed, one would use a pre-calculated index array to determine the storage location of 
each matrix element C M v. As we run sequentially through the B-neighbours j of each atom k for a given 
primary atom i, we would look up the storage location where this contribution has to be accumulated onto 
the C-array. This is what was done in an earlier version of the Conquest matrix- multiplication scheme |22j . 
The major objection to this is that the atom j in B^j v is most easily specified in neighbour labelling, in 
which case the pre-calculated index array will depend on a triplet of labels (i,k,j), so that the storage 
requirements of the index become exorbitant. 

Our answer to this is to abandon the use of the pre-calculated index, and instead to determine the storage 



locations of C-elements at run time. As explained in Sec. 2.4.2, it is straightforward to calculate the local 
label of j for each element B^ju, provided we pass the offset of the partition containing j relative to the 
partition containing k in three Cartesian directions. Our procedure is then to transcribe from this local label 
to the C-halo label of j at run-time in the multiplication kernel, and then to look up the address of each 
C^jv in a pre-calculated index array which takes as input the primary label i and the C-halo label j. Since 
this index array depends on only two labels, its storage requirements are moderate. 



2.5 Detailed coding of the multiplication kernel 

The principles outlined above are embodied in the F90 coding of the multiplication kernel displayed in Fig. |[ 
This calculates and accumulates the contributions to the C-matrix from >l-halo atoms k in a given A-halo 
partition K, the latter being specified in the code by the variable kpart. Note that the outer loop in the 
kernel goes only over A-halo atoms k in K (the number of these being ahalo°/ n nh_part (kpart) ), and not over 
all atoms in K. The code inside the outer loop consists of three sections: the three lines of the first section 
set up useful variables associated with atom fc; the purpose of the second section, consisting of a single loop 
over atoms j, is to construct the table jbnab2cli( j ) which transcribes from _B-neighbour labelling to C-halo 
labelling of the B-neighbour atom j; the third section is a double loop over primary A-neighbours i and 
S-neighbours j of the atom fc, within which the multiplication and accumulation are performed. We now 
comment on the three sections. 

In the first section, k_in_halo is the A-halo label of atom fc; the array ahalo°/ j_beg (kpart) gives us the 
A-halo label of the first ^4-halo atom in the partition K . The variable k_in_part is the sequence number of 
atom k in K; this means the number in the sequence when all atoms in K are counted, and not just yl-halo 



atoms. Finally, nbkbeg is the address where ^-neighbours of atom k start in the array b() holding rows of 
the i3-matrix for the current partition K. 

In the second section of the kernel, we loop over all B-neighbours of the current atom k. The variables 
jpart and jseq in this loop are the partition number and the sequence number within its partition of 
neig hbo ur j. (The variable k_of f is connected with periodic boundary conditions, and will be explained in 
Sec. 2J3). The two quantities jpart and jseq together consistute the 'local' label of atom j in the partition 
labelling scheme. The array jbnab2ch(j) transcribes from the B-neighbour label j to the C-halo sequence 
number, and plays a key role in what follows. 

The outer loop in the third section goes over the at7 nJmab(k_in_lialo) primary-set ^4-neighbours i of 
fc, nabeg is the address in the array a() where the corresponding element of A is stored, i_in_prim is the 
sequence number of i in the primary set, and icad is a starting address needed later. The inner loop goes over 
all the nbnab(k_in_part) B-neighbours of k, and nbbeg is the address in the array b() holding elements 
of B. We then use the table jbnab2ch(j) constructed in the second section to transcribe from j to the 
C-halo label. The pre-calculated table chalo%i_ti2d(icad+j_in_halo) is then used to look up the address 
in c() where the product will be stored. The if statements that test the values of j_in_halo and ncbeg are 
redundant in the maximal case Rc > Ra + Rb, since they can never be satisfied, but they are needed for 
the general case, as explained later. The triple loop over n3, nl and n2 that performs the multiplication of 
n x n matrices for each atom triplet (i,k,j) is self-explanatory. 



2.6 Periodic boundary conditions 

With periodic boundary conditions (pbc), the system we study is an infinite Bravais lattice with a (generally 
large) basis. The system is invariant under translation by an Bravais lattice vector. For book-keeping 
purposes, we regard one of the primitive unit cells of the Bravais lattice as the 'fundamental' cell. The 
elements of every matrix X^jv are stored for all i in the fundamental cell. Now the cut-off radius Rx of 
any matrix is not to be constrained in any way by the dimensions of the repeating cell, so that the atoms j 
for which Xi^j^ is non-zero can be in the fundamental cell or in any image of this cell. For given i, there 
can be different atoms j, j' having non-zero values of Xi a j$ which are periodic images of each other. 

The use of pbc does not disturb the code structure already established, provided one adopts the right 
viewpoint. Every node knows about atoms in its primary set, and also about atoms in its grand covering 
set (GCS), which contains all the haloes that it needs. All atoms i in its primary set can be regarded as 
belonging to the fundamental cell, so that none is a periodic image of any other. But atoms in its GCS may 
be periodic images of each other. For most purposes, all the atoms in the GCS and in all the haloes are 
regarded as completely distinct, irrespective of whether they are periodic images or not. 

To ensure efficient and correct operation of the code with pbc, two steps must be taken. First, as we loop 
over ^4-halo partitions K , we note that for two such partitions that are periodically equivalent, the sets of 
Bk£,jv data for the two partitions are identical, and we clearly do not wish to communicate the same data 
more than once. We avoid this by insisting that the order in which A-halo partitions are passed through is 
such that any partition and its periodic images in the halo are grouped together contiguously. Then for each 
A-halo partition, we check whether it is periodically equivalent to the previous one; if it is, then nothing is 
communicated from the remote node, and we use the _B-data most recently communicated. The loop over 
A-halo partitions is outside the multiplication kernel, so this does not affect the code shown in Fig. ||. 

The second step needed is to take account of the periodic offsets when calculating the storage locations 
where elements of the C-matrix are accumulated in the multiplication kernel. The B-matrix data supplied 
to the kernel include a label identifying the partition to which each i?-neighbour atom j belongs: this is 
given in the array ibpart () - see Fig. ||. But this label must be modified to account for the periodic offset of 
partition K. Provided we use a suitable labelling of partitions in each node's GCS, the modification consists 
of adding a single offset parameter k_of f to the partition label, this parameter being passed as an argument 
to the multiplication kernel. This addition to the first statement in the j-loop (see Fig. ||) is the only place 
in the kernel that pbc appear explicitly. 



2.7 Practical construction of indexing arrays 

The multiplication kernel needs certain indexing arrays and transcription tables, and we outline briefly the 
principles used in constructing these. 

In making all the labelling, a key role is played by the grand covering set (GCS) mentioned above. In 
general, a 'covering set' is a set of partitions that contains all the neighbours of an atom, or the primary-set 
halo, associated with a given cut-off radius; it is used in constructing lists of neighbours and halo atoms. 
The GCS is the largest such covering set, so that it can be used for making all the neighbour and halo lists. 



Each node has its own GCS. It is convenient to choose the GCS to be orthorhombic in shape, because this 
simplifies the labelling. For any given cut-off radius Rx, the node constructs its neighbour lists by passing 
sequentially through all atoms in its primary set and all atoms in its GCS, calculating the interatomic 
distances, comparing with Rx, and making a record of the neighbours. This record is then used to make 
the corresponding halo list. 

The ordering of the lists is important, so we need to consider the order for enumerating GCS partitions. 
In fact, we need to consider two different orderings. In the first, we label GCS partitions using the obvious 
Cartesian system. The GCS is orthorhombic and the numbers of partitions along its three edges are called 
Li, L 2 , L 3 , so that any partition can be labelled by the triplet {h,h,h), with 1 < l s < L s (s = 1,2,3). 
Then we define the label A = Qx — 1)^2-^3 + (h — l)-^3 + h, which we call the 'Cartesian composite' (CC) 
label. 

CC ordering is not appropriate for enumerating A-halo partitions when fetching B-data from remote 
nodes, because for this we should group partitions together by their home node and (if relevant) according 
to periodic images (see Sec. |2.6| ). To make a suitable ordering for this purpose, we adopt a standard order 
for enumerating nodes, and a standard order for primary partitions on each node. Then a node enumerates 
the partitions of its GCS by ordering them with respect to: (i) periodic images; (ii) partitions on each node; 
(iii) node. It passes most rapidly through periodic images, so that all images of a given partition are listed 
contiguously; next most rapidly through partitions on each node, so that all partitions on each node are 
contiguous; and least rapidly through nodes. It is convenient to start with the local node itself, then to go to 
the next node occurring in the GCS in increasing order of the standard node enumeration, wrapping round in 
this enumeration as necessary. We call this ordering of GCS partitions node-order periodic-grouped (NOPG). 
This is the GCS ordering used for forming the A-halo, so that when we pass through A-halo partitions we 
are implicitly performing a triple loop over halo-nodes, halo-partitions on each node, and periodic images of 
each halo-partition - the latter being needed to avoid unnecessary communication of i?-data, as explained 



in Sec. 2.6 



2.8 The minimal case 

So far, we have restricted outselves to the 'maximal' case Rc > Ra + Rb- But, of course, we need to handle 
general cut-off radii. As a first step towards doing this, we examine now the case Rc <| Ra — Rb \, which 
we call the 'minimal' case. We shall show that a trivial modification of the code already presented for the 
maximal case gives an efficient code for the minimal case. To see this, we must first outline an important 
general principle. 

For the superscalar machines that are our main interest, the speed is limited mainly by transfers between 
memory and registers via the cache: the operations performed on the data once it is in registers have little 
effect on the speed[Q. So let us focus on storage and transfer patterns, and ignore the register operations. 
In the maximal case, one set of data (the A^^ matrix elements) are in local memory and transfer occurs in 
a regular and predictable way; data that behave like this will be called X . A second set of data (the B^j U 
elements) is fetched from remote nodes, but again is transferred in a regular way; this kind of data will be 
called Y . Finally, a third set of data (the Ci^.j V elements) is in local memory, but its transfer occurs in an 
irregular way and addressing information has to be calculated for it at run time. Data like this will be called 
Z . The appropriate data storage and transfer patterns for the minimal case are obtained by identifying C as 
X, B as Y and A as Z, where B denotes the transpose of B - whereas in the maximal case A was X, B was 
Y and C was Z . In the minimal case the multiply- and- adds needed are X = Z ■ Y — whereas in the maximal 
case we did Z = X ■ Y. As far as the multiplication kernel is concerned, it is transferring certain patterns of 
data from memory and doing certain things with it in registers. Provided the data passed to the kernel and 
the operations performed with it in registers are changed appropriately, the maximal and minimal cases are 
exactly equivalent. 

It follows from this that the multiplication kernel for the minimal case is coded exactly as in Fig. ^, 
except that the statement within the loops over n2, nl and n3 must be replaced by the following statement: 

a(n3 ,nl ,nabeg)=a(n3 ,nl ,nabeg)+& 
c (nl , n2 , ncbeg) *b (n3 , n2 , nbbeg) 

The structure of all the indexing arrays remains exactly as before. Note that in the minimal case, as in the 
maximal case, no tests of cut-off distances are needed, and the if statements in the kernel are redundant. 



2.9 General cut-off radii 



To treat general cut-off radii, we consider first the case Rc < Ra + Rb, where Rc is only a little smaller 
than the sum of Ra and Rb- This case is efficiently handled by using the code for the maximal case (see 
Sec. 2.5), with the j-loop going over all B-neighbours of k, but with an if-statement inserted so that the 
multiply-and-adds are not done if j is not a C-neighbour of i. Needless to say, there must not be any 
question of calculating interatomic distances in the multiplication kernel itself. However, using the arrays 
which transcribe from local labelling to C-halo labelling, and which give the storage address in the C-array 
for each pair, we can determine easily whether j is a C-neighbour of i. The convention we use is that the 
appropriate array elements have the value zero if the atom j is not in the C-halo or is not a C-neighbour; the 
test is then applied by the two if statements in Fig. [| We call this way of treating the case Rc < Ra + Rb 
'weak reduction', the adjective 'weak' referring to the fact that it will work well only if Rc does not fall 
too much below Ra + Rb- This loss of efficiency caused by the wasted passes through the do-loop over j is 
estimated in the Appendix. 

If Rc is very small, it will be more efficient to perform the multiplication by 'weak extension', i.e. by 
using the kernel for the minimal case, but with two if -statements i nse rted exactly as for the maximal kernel 



(using the symmetry between the two cases identified above in Sec. |2.8[) . The effect of wasted passes through 
the j-loop is estimated in the Appendix. 

Once the if-statements have been put into the maximal or minimal code, any case can be treated with 
either kernel. We will obtain correct results even if we treat the minimal case using the maximal kernel, or 
vice versa. Nevertheless, it is clearly more efficient to treat weak reduction using the maximal kernel and 
weak extension using the minimal kernel. But how, in general, can we tell whether it is more efficient to use 
the maximal or minimal kernel? This question is also addressed in the Appendix, where we show that the 
maximal kernel should be used when Rc > Ra and the minimal kernel when Rc < Ra- 



2.10 Load balancing 

However efficient the on-node operations and the communications may be, the overall performance will be 
poor unless the work load is balanced between processors, so that no processors are idle while others are 
working. The partition scheme gives us a natural way to achieve load balancing: our aim must be to group 
the partitions into bundles, with each bundle containing the primary set given to a node, in such a way 
that all nodes take roughly the same computation-plus-communication time, with this time being as small 
as possible. We plan to report in more detail elsewhere on load balancing in the Conquest code, and here 
we give only a brief summary of our methods. 

Our approach is to derive an approximate formula for the 'cost' of the calculation, and to use simulated 
annealing to search for a way of organising partitions into bundles which comes close to minimising the cost. 
This search is done by a separate code before the matrix multiplication code is run. Let t c ° mp and t9 omm be 
the times taken by node i to perform on-node computation and inter-node communication, respectively, and 
let ti = tf mp + (™ m be the sum of the two. (We assume that computation and communication cannot be 
overlapped, and that the former has to wait for the latter.) Then we have to minimise the 'cost function' T 
given by T = maxj(ti); this quantity T is the largest of all the ti values. 

We can associate a computation time r p with each partition p. This time is proportional to the number 
of i, k,j triplets for which atom i is in partition p. The proportionality constant can be estimated from the 
compute rate of the processors and the cpu efficiency, whose determination for any processor is described in 
Sec. |4| The value of ^ omp is then the sum of r p over the partitions in the bundle assigned to processor i. 
The value of t c ° mrn is found by dividing the number of B^jv elements that node i fetches by the practical 
inter-node transfer rate, with an allowance for latency. 

We follow the usual methods of simulated annealing J3lj] . Let F label a given grouping of partitions into 
bundles - this consists of a list of the partitions in every bundle - and let Tr be the value of T for this 
grouping. Then we sample the groupings F in such a way that the probability of finding T is proportional 
to a Boltzmann function exp(— Tp/O), where 9 is a fictitious temperature. The value of 9 is systematically 
reduced towards zero during the simulation, so that the only surviving T's are those whose Tr's are close to 
the minimum possible. To sample the space of groupings T, we use an algorithm to step from one T to the 
next. As usual in simulated annealing, the new grouping is accepted without question if Tr decreases; it is 
accepted with probability exp(— ATp/9) if Tr increases, where ATp is the increase. 

The choice of algorithm for stepping from one T to the next is not trivial. A significant problem arises 
from the rather curious definition of the cost function as the maximum of U values. As we search through 
groupings T, the time Tr changes only if we make reassignments of partitions affecting the particular bundle 



having the maximum £j. For large numbers of processors, this can make equilibration of the 'thermal' 
ensemble very slow. The measures taken to overcome this will be described elsewhere. 

3 Coding considerations 

One of the key aims when coding Conquest was to ensure portability as well as performance. Accordingly, 
the code was written in F90 with all communications written using MPI1 calls (though, as explained later, 
consideration was also given to rewriting critical communications with high speed, system specific calls, such 
as shmem on the Cray T3E). F90 improves the readability and structure of the code through use of modules 
and derived types, while retaining enough similarity to Fortran77 to be widely understood by the scientific 
community. 

The communications are key to the efficiency of the code. There are two types of object to be commu- 
nicated: the matrix elements themselves and the accompanying indexing. Efficient communication of the 
matrix elements presents no problems, since these form contiguous arrays, and are passed in relatively large 
portions (all elements for all the atoms in a partition K). The communication of the indexing, however, 
raises other problems. 

The indexing for a matrix is all enclosed in a single derived type, type (matrix) , of which only part 
needs to be communicated. The derived type is defined so that it contains all the data for all the atoms in a 
partition / (and therefore an array of these types is defined to hold the data for a node's primary set). One 
possibility for the communication of the indexing would be to use the MPI1 facilities for registering a new 
type and to pass individual partitions in a single call for that new type. However, the implementation of 
communication of defined types within MPI1 leaves room for loss of efficiency, and another way was found. 
Essentially, those indexing arrays which need to be communicated arc defined within the type as pointers, 
and point to different parts of an integer array. Then the communication of all relevant indices for a partition 
can be made via a single MPI call, passing a single, large integer array. Pointers are then put into place 
at the receiving node to access the indices. Both the matrix element arrays and this large integer array are 
declared as global variables so that on the Cray they will be symmetric, and shmem calls can be used on 
them. 

While the code has been written within MPI1, the communication pattern is naturally a one-sided one 
(since individual nodes require data for partitions at disparate times). We have developed a quasi-one- 
sided scheme within MPI1, as well as using truly one-sided schemes where available (e.g. under MPI2, 
which is not yet sufficiently widely supported to be truly portable, or under shmem on the Cray T3E or SGI 
Origin2000). At the start of the matrix multiplication, each node issues a series of non-buffered, non- blocking 
sends (assuming, quite reasonably, that no implementation of MPI1 would buffer the MPI_issend call). As 
partition data is required, nodes issue MPI_recv or MPI_irecv calls (as appropriate) to obtain the data. This 
scheme works extremely well, and shows good performance and scaling. It is easy to overlap communication 
and calculation by issuing MPI_irecv calls ahead of time. 

4 Practical tests 

We want to test the scaling properties and the efficiency of the code. As emphasised in our earlier tests 
on the Conquest code p2| , there are different types of scaling. For given cut-offs Ra, Rb, Rc-, the cpu 
and memory needed to do the multiplication should be proportional to the number of atoms: this is called 
'intrinsic scaling'. For a given calculation on a given number of atoms, one hopes to find cpu and memory 
inversely proportional to the number of processors: 'strong parallel scaling'. Finally, we can examine the cpu 
and memory as the numbers of atoms and processors are scaled up with the number of atoms per processor 
held fixed: 'weak parallel scaling'. There is a relation between the three scaling types, so that we can assess 
the scaling behaviour completely by studying weak and strong parallel scaling, and we shall present results 
on this. Good scaling does not necessarily mean good efficiency. For example, if the on-node operations 
were very inefficient, communications could be rendered negligible, so that parallel scaling would appear 
very good, even though the underlying performance was poor. We shall therefore pay special attention to 
cpu efficiency. 

Our main tests are on completely random arrangements of atoms with periodic boundary conditions; 
the position of each atom in the cell is created by using a random-number generator to draw its x, y and 
z coordinates from a uniform probability distribution. We choose to use random positions rather than, 
for example, perfect-crystal positions, because the randomness should provide more of a challenge to load 
balancing, especially for small numbers of atoms per node. In order to make contact with the real world, we 



take a mean density of atoms equal to that of diamond-structure silicon, and we use values of Ra, Rb and 
Rc typical of those needed in real first-principles calculations with the the Conquest code. The values of 
the matrix elements are irrelevant for these tests, since we are only concerned with the time taken to perform 
the operations; we have actually taken their values to be a simple radial function of interatomic separation. 
The number of indices on each atom is taken to be n = 4. 

We start by presenting results obtained on the Cray T3E-1200E using shmem communications. We 
examine the maximal case (Rc = Ra + Rb) with Ra = 8.46 A, Rb — 4.23 A. This corresponds roughly to 
performing the multiplication L ■ S in Conquest (L is the auxiliary density matrix, S the overlap matrix). 
Fig. U shows results from our tests of weak parallel scaling with 80 atoms per processor and processor 
numbers going from 2 to 250 (numbers of atoms from 160 to 20,000). The simulation cell is always a cube 
in these tests, and the partitions are also taken to be cubic. The partitions always have the same size, and 
the average number of atoms per partition is 20. To obtain the timings, we put a timer in the code running 
on each node, and we report the maximum and minimum times, together with the average time. (The times 
include all communication, but they deliberately exclude the construction of the indexing arrays needed 
as input to the multiplication kernel; this is justified, since in practice these arrays are reconstructed only 
when the atoms move, whereas many matrix multiplications are done for each set of atomic positions.) The 
following points are noteworthy. First, the spread between minimum and maximum times is small, so that 
the load balancing is good. In practice, the time taken is governed by the time of the slowest node. Since 
the maximum time is only ~ 6.4 % greater than the average, little would be gained by working harder on 
load balancing, at least for random atomic arrangements. Second, the times are essentially constant, the 
time on 250 nodes being only ~ 4 % greater than that on 16 nodes. This means that the size of system that 
could be treated is limited only by the number of nodes available: there is essentially no limitation from the 
parallel behaviour of the code itself. 

Tests of strong parallel scaling performed for the same Ra and Rb values (maximal case) with a total of 
4096 atoms in the whole cell are shown in Fig. ^, where we show the total time summed over the processors 
as function of processor number (which would be completely flat if the scaling were ideal). A useful way 
to assess the scaling is by comparing the times with Amdahl's law, which states that the time taken is 
t = to((l — x) + x/N proc ), where N proc is the number of processors, with x and (1 — x) the fractions of 
the calculation performed in parallel and serially. This formula fits the maximum times very well, and we 
find the value x — 0.9986, so that the serial fraction is 0.0014. This means that strong scaling is extremely 
good up to ~ 200 processors, or ~ 20 atoms per processor. Since it would be wasteful of memory to run 
Conquest with fewer atoms per processor than this, the implication is that strong scaling will be very well 
satisfied in practical calculations. 

We now turn to the minimal case (Rc <| Ra~Rb |)- We argued in Sec. 2.£ that the speed will be limited 
mainly by transfers to and from memory, rather than by operations in registers. This implies that, provided 
we use the minimal, rather than the maximal, multiplication kernel, the timings for the multiplication with 
R A = 12.69 A, R B = 4.23 A, R c = 8.46 A should be almost the same as for R A = 8.46 A, R B = 4.23 A, 
Rc = 12.69 A. We have tested this explicitly by performing multiplication with Ra = 12.69 A, Rb = 4.23 A, 
Rc = 8.46 A on 4096 atoms with different numbers of processors. The results are also reported in Fig. ||. 
We see that the timings are very close to our results for the maximal case, though higher by ~ 16 %. This 
means that all our conclusions about weak and strong scaling will be the same for the minimal case. 

We now turn to cpu efficiency. Here, we are solely concerned with the execution speed of the multiplication 
kernel itself, so that communication time is excluded. As explained in the Introduction, efficiency is defined 
in terms of the minimum number of operations required to perform the multiplications of eqn ([!]). For a 
given node, this is the number of (i, k 7 j) triplets for i in that node's primary set, multiplied by n\ x X 112, 
multiplied by a factor of 2 to allow for the fact that we do a multiply and add for each (i/i,k^, jv). This 
minimum number of operations divided by the time spent in the multiplication kernel gives the 'useful' Mflop 
rate, which is then divided by the peak Mflop rate of the processor to give the cpu efficiency. 

We have done efficiency tests on both perfect-crystal and random atomic positions. As an illustration, 
we report results on perfect-crystal silicon (lattice parameter = 5.46 A) with cut-offs Ra = 6 A, Rb — 10 A, 
Rc — 16 A with 1728 atoms in the total system. The tests were done with 64 cubic partitions and 16 
processors on the Cray T3E-1200E, the partitions being distributed unevenly over the processors. The 
'useful' Mflop rate was found to be almost identical on all processors, the average being 101.4 Mflops, and 
minimum and maximum being 101.1 and 103.6 Mflops. The theoretical peak speed of the processor is 
1.2 Gflops, so that the efficiency is 8.4 %. This result is reasonably satisfactory in two ways. First, it 
shows that all the operations associated with transcription between labellings, the determination of storage 
addresses, etc cannot be taking a large amount of time (recall that these are not included in the number of 
'useful' operations); second, they show that fairly good use is being made of the cache. 



We have also performed tests on a number of other machines: a Beowulf linked by fast ethernet; a Beowulf 
linked by SCI interconnect; and a SGI Origin2000. We have tested weak scaling on all these systems to 
investigate the portability of the code, and the importance of efficient communications. The increase in time 
per node as the system size is increased is shown in Fig. ^. We remark that, given perfect communication, 
the line ought to be flat at 1.0, and that to all intents and purposes this is true for the Origin2000 and 
T3E. The SCI beowulf system performs well and should be adequate for Conquest calculations. The fast 
ethernet beowulf, however, shows far too large an increase. 

Finally, we assess briefly the implications of our new methods for practical calculations with the full 
Conquest code. The present matrix multiplication code has been incorporated into Conquest, and we 
have run tests in which the energy of the self-consistent ground state is calculated for silicon systems. In 
searching for the ground state, the basic step is the 'self-consistency iteration', in which an input charge 
density is used to calculate the Kohn-Sham Hamiltonian matrix, and the techniques of Ref. |Q are used 
to determine the L-matrix that yields the non-self-consistent ground state for the current Hamiltonian, and 
hence the output charge density. We search for self-consistency using the GR-Pulay technique p4j. 

We report in Fig. ^ timings for a single self-consistency iteration obtained for crystalline silicon containing 
numbers of atoms going from 4096 to 12288 when the calculations are run on 512 processors of the T3E- 
1200E. For these tests, we used a norm-conserving non-local pseudopotentialjss). The support-region radius 
i? rcg was equal to 2.715 A and the L-matrix cut-off, Rl equal to 6.9 A (for definitions of these cut-offs, see 
Ref. p4|); these values are typical of what would be used in practice. The Figure shows that the scaling with 
system size is so good that the time is actually a slightly sub-linear function of atom number. We believe 
that this is because, as the primary sets become larger, the number of nodes with which communication is 
required actually decreases, making the time required decrease. To put these results in context, we note that, 
starting from scratch, typically 10 self-consistency iterations are needed to reach the self-consistent ground 
state with acceptable accuracy for fixed support functions, and that a further factor of 10 should be allowed 
for variation of the support functions. This means that, with the methods presented here, the attainment of 
the full ground state with an accuracy comparable to that expected in a conventional plane- wave calculation 
can be accomplished for a 12,000-atom system in a few hours on a 512-processor T3E. 



5 Discussion and conclusions 

We have shown that our code for parallel multiplication of locally sparse matrices achieves two important 
things: first, it displays almost perfect linear scaling with respect to numbers of atoms and processors; 
second, it achieves good cpu efficiency. 

We remark that we have every right to expect the linear scaling to be perfect when the numbers of 
processors and atoms are increased together, with a constant number of atoms per processor (weak scaling): 
for this type of scaling, the amounts of computation and communication should, indeed, remain constant. 
What is less trivial is the type of scaling in which the number of processors is increased with the total 
number of atoms held fixed (strong scaling). In contrast to some previous schemes, the present code shows 
excellent strong scaling, at least on machines having a fast communications network. Good cpu efficiency is 
also non-trivial. 

A key idea in the present scheme is the management of the atoms in small compact groups, referred to as 
'partitions'. Although this certainly makes the code more complex, it has the enormous advantage of giving 
an internal structure to the calculation, which can be used in a number of ways: it gives a flexible means of 
constructing the primary sets that are assigned to processors; it enables us to manage the communications 
by breaking them into packets of convenient size; it does the same for the on-node computations, so that we 
work with a convenient 'multiplication kernel'; it provides a labelling scheme which allows nodes to identify 
to each other the atoms for which data is communicated; finally, it aids efficient cache use by improving 
data locality. It is important to note that the present scheme of partitions and bundles for managing atoms 
strongly resembles the scheme of blocks and domains for managing integration-grid points in Conquest 
described in our earlier work p2| . A further key concept in the present work is that of transcription 
between different atom-labelling schemes, which is crucial in reducing the computational overheads in the 
multiplication kernel. 



At the start of our analysis (Sec. 2.1), we made certain assumptions about the storage of matrix elements 
and the way the matrix multiplication should be done, and we now ask whether anything could be gained 
by changing these assumptions. Our answer is that little is likely to be gained. With the scheme we have 
presented, the scaling properties are already close to ideal, and there is little room for improvement. It 
is true that our cpu efficiency is only ~ 10 %, and there is clearly scope for improvement here. However, 
one should remember that efficiencies as high as 50 % on superscalar processors are achievable only in the 



multiplication of large non-sparse matrices, and that serious sparsity is bound to incur a serious loss of 
efficiency. Our suspicion is that an improvement of the cpu efficiency by as much as a factor of two is 
unlikely, and that any such improvements are more likely to come from work on the multiplication kernel 
than by changing the basic assumptions. Time will tell. 
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Appendix: Reduction and extension, maximal and minimal kernels 

As explained in the text, different multiplication kernels should be used for large and small values of the 
cut-off radius Rc of the product matrix C: for large and small Rc, we should use the maximal and minimal 
kernels respectively. In this Appendix, we discuss how to decide the cross-over point from the maximal to 
the minimal case as the radii Ra, Rb and Rc are varied. 

If we have particular values of Ra, Rb and Rc, then for each atom i in a node's primary set, there is a 
certain number of pairs (k,j) for which multiply-and-adds have to be done. Our algorithm for deciding the 
cross-over point will be based on estimates for this number of pairs, which we denote by v p . If we treat the 
problem as weak reduction, then we ignore the constraint supplied by Rc, and instead of the true number of 
pairs v p we actually visit a larger number of pairs, which we call i^™ ax . The merit of doing it like this can be 
gauged by the ratio ^ max = v p /v^ ayi . If this ratio becomes too small, then it is wasteful to treat it as weak 
reduction. Similarly, if we treat as weak extension, we visit a number of pairs v ™ ln which will generally be 
greater than v p , so we can gauge the merit of this method by the ratio r/ min = i> p /i>™ ln . We should go for 
the method that gives the largest rj value, so the policy will be that for given values of Ra, Rb and Rc we 
treat it as weak reduction if ^™ ax < z/™ m and as weak extension if z/™ m < j/™ ax . 

The values of u pi j/™ ax and in practice will depend on the details of the atomic positions. We need 
a simple approximate way of estimating the numbers of pairs. To do this, we assume that the density of 
atoms is uniform, with p atoms per unit volume. Then our estimate for the number of pairs is: 



It is simple geometry to evaluate the integral. Let 0,(Ra, RB',r) be the overlap volume of two spheres of 
radius Ra, Rb when the distance between their centres is r. Then: 



It will be useful to note some symmetry properties of v p {Ra,Rb,Rc)- Since Q(Ra, Rb',t) is invariant 
under interchange of Ra and Rb, the same is true of v p . In fact, it is easy to show that v p is completely 
invariant with respect to all permutations of Ra, Rb and Rc- 




r 1 <R,A,r 2 <Rc, |r2— ri|<fl B 



dridr 2 



(2) 




(3) 



v v (R a ,Rb,Rc) = v v {Rc,Ra,Rb) = v p (Rb,Rc,Ra) 

= v p (R b ,Ra, Rc) = v p (Ra, Rc, Rb) = v v (Rc, Rb,Ra) 



(4) 



These symmetries very much simplify the calculation of v p for general values of Ra, Rb and Rc- Let R\, 
R 2 and i?3 denote the smallest, the middle and the largest of Ra, Rb and Rc- Then for arbitrary values of 
Ra, Rb and Rc, we can write: 

u p (R a ,R b ,R c ) = u p (R 1 ,R 2 ,R 3 ) . (5) 

There are two separate cases that must be considered, which we refer to as 'triangle' and 'non-triangle'. In 
the first case, we have i?3 < Ri + R 2 , so that the three lengths can form the sides of a triangle. In the 
second, we have R3 > R± + R 2 , and they cannot form the sides of a triangle. The non-triangle case is trivial, 
because the constraint supplied by R3 is of no effect, and we just have: 

v p (R u R 2 ,R 3 ) = pl / dndr 2 = —ir 2 p 2 RlR 3 2 . (6) 

Jri<Ri,r 2 <R2 " 

In the 'triangle' case, we return to eqn. (||) and note that the overlap volume of two spheres is: 

n(R lt Ra;r) = -ttE? if r < R 2 - R l 

= -^ { Rl-Rlf + ^ { Rl-Rl)-\nr(Rl + Rl) + ^ iir>R 2 -R x . (7) 
The integral over r is then straightforward, and we get: 

v p = n 2 P l (jRl(R 2 - R 1 ) 3 - \{R 2 2 - R\? [Rl - (R 2 - i? x ) 2 ] + + Rl) [JZ| - (R 2 i? x ) 3 ] 

- \{R\ + Rt) [Rj (R 2 R,) 4 } +^[Rl~ (Ra - i?i) 6 ]) ■ (8) 

It is now simple to find the cross-over point and the rj value at this point. The strategy for weak 
reduction is based on ignoring the constraint supplied by Rc- Then ^™ ax is just v p for the non-triangle case 
Rc > Ra + Rb, so that from eqn. (ra) we have: 



v. 



max _ ^„2 1 p.; !>■: 

p " 9 



tt'pZRaRb • (9) 



In weak extension, we ignore the constraint supplied by Ra- Then z/™ ln is v p for the non-triangle case 
Ra > Rb + Rc, so we have: 

16 



min fT7 . 2 „ 2 £>3 c>3 

p "9 



-tt 2 p 2 R b R c - (10) 



The cross-over point is the point at which j/™ ax = i/™ ln , which requires that Ra = Rc- ln terms of the 
dimensionless ratio a = Rc/(Ra + Rb), we can express the cross-over point as: 

a x = 1/(1 + 0, (11) 

where £ = Rb/Ra- 

The efficiency rjx = v p ia ~ x /v p — v™™ jv p at the cross-over point can now be found from eqn. (||). The 
derivation of the resulting formula: 

r?X = l-— £+— £ 3 (12) 
' 16^ 32^ y ' 

is straightforward. 

In practice, we should always try to ensure that Ra > Rb, so that < £ < 1. It is readily shown that 
ijx increases montonically from 15/32 to 1 as £ decreases from 1 to 0. This means that the efficiency cannot 
drop below 15/32, or, in round numbers, 50 %. 



Loop over remote nodes 
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Loop over sets of atoms, k 



j 

Fetch B data 



I 

Perform multiply-and-add 



End loop over sets of atoms, k 



I 

End loop over remote nodes 



Figure 1: Interleaving of communication and calculation in parallel multiplication of sparse matrices. 



Figure 2: Schematic illustration of partitions and haloes. Atoms are indicated by small, black circles, and 
partitions by the dashed lines. A particular atom, i, is patterned. Its A-neighbours are those atoms that are 
within or touched by the circle drawn centred on i. If the primary set is chosen to be the partition containing 
i, then the A-halo partitions are those enclosed by the heavy line. 



do k=l,ahalo7 nh_part(kpart) ! Loop over atoms k in current A-halo partn 

! useful variables associated with atom k 

k_in_halo=ahalo°/ j_beg(kpart)+k-l 
k_in_part=ahalo%j_seq(k_in_halo) 
nbkbeg=ibaddr (k_in_part) 

! transcription of j from B-neighbour to C-halo labelling 

do j=l ,nbnab(k_in_part) 

jpart=ibpart (nbkbeg+j-l)+k_of f 

j seq=ibseq(nbkbeg+j-l) 

jbnab2ch(j )=chalo°/,i_halo(chalo°/,i_hbeg(jpart)+jseq-l) 
enddo 

! perform multiply and adds 

do i=l , at°/ n_hnab(k_in_halo) ! Loop over primary-set A-neighbours of k 

nabeg=at°/ i_beg(k_in_halo)+i-l 

i_in_prim=at / i_prim(at /oi_beg(k_in_halo)+i-l) 

icad=(i_in_prim-l) *chalo°/ ni_in_halo 

do j=l ,nbnab(k_in_part) ! Loop over B-neighbours of atom k 
nbbeg=nbkbeg+ j - 1 
j_in_halo=jbnab2ch(j) 
if ( j _in_halo/=0) then 

ncbeg=chalo°/ i_h2d(icad+j_in_halo) 

if (ncbeg/=0) then ! multiplication of ndim x ndim blocks 
do n2=l,ndim3 
do nl=l,ndiml 
do n3=l,ndim2 

c(nl,n2,ncbeg)=c(nl,n2,ncbeg)+ & 
a (n3 , nl , nabeg) *b (n3 , n2 , nbbeg) 
enddo 
enddo 
enddo 
endif 

endif ! End of if (j_in_halo.ne.O) 
enddo ! End of 1 , nbnab 
enddo ! End of 1 , at°/ n_hnab 
enddo ! End of k=l , ahalo°/ nh_part (kpart) 
return 



Figure 3: The matrix multiplication kernel. 
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Figure 4: Time recorded on each processor for the maximal case of sparse-matrix multiplication. Numbers 
of processors and atoms are varied with average number of atoms per processor equal to 80 in all cases. 
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Figure 5: Times t recorded on each node for maximal (solid line, circles) and minimal (dashed line, squares) 
cases of sparse matrix multiplication when number of processors N pTOC is varied with total number of atoms 
in the system equal to 4096 in all cases. Quantity plotted is the sum of the times on all processors (t.N pTOC ), 
which would be exactly constant for perfect strong parallel scaling. 
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Figure 6: Average time per processor as a multiple of time on two processors for constant number of atoms 
per processor (equal to 32). Maximal case for crystalline silicon with cut-off radii Ra = 12A, Rb = 8Aand 
R c = 20A. 
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Figure 7: Time for one self-consistency iteration (see text) of the Conquest code as a function of number 
of atoms of perfect-crystal silicon. Calculations were run on 512 processors of a Cray T3E-1200E. Points 
connected by solid line show elapsed time in seconds; dashed line shows time scaled linearly from the actual 
time for 4096 atoms. 



